Week 5

dir <- "~/work/courses/stat380/weeks/week-5"
setwd(dir)
library(renv)
# renv::activate()
renv::restore()

Attaching package: ‘renv’


The following objects are masked from ‘package:stats’:

    embed, update


The following objects are masked from ‘package:utils’:

    history, upgrade


The following objects are masked from ‘package:base’:

    autoload, load, remove

* The library is already synchronized with the lockfile.

Agenda:

  1. Interpretation of regression coefficients
  2. Categorical covariates
  3. Multiple regression
    • Extension from simple linear regression
    • Other topics

Packages we will require this week

library(ISLR2)
library(dplyr)
library(tidyr)
library(readr)
library(knitr)
library(ggplot2)
library(cowplot)

Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Tue, Feb 7

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

where:

  1. \(y_i\) are the response
  2. \(x_i\) is the covariate
  3. \(\epsilon_i\) is the error (vertical black line in lecture 4 notes)
  4. \(\beta_0\) and \(\beta_1\) are the regression coefficients
  5. \(i = 1, 2, \dots, n\) are the indices for the observations

Can anyone tell me the interpretation for the regression coefficients?

\(\beta_0\) is the intercept and \(\beta_1\) is the slope.

Let’s consider the following example using mtcars

library(ggplot2)
mtcars %>% head()
A data.frame: 6 × 11
mpgcyldisphpdratwtqsecvsamgearcarb
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Mazda RX421.061601103.902.62016.460144
Mazda RX4 Wag21.061601103.902.87517.020144
Datsun 71022.84108 933.852.32018.611141
Hornet 4 Drive21.462581103.083.21519.441031
Hornet Sportabout18.783601753.153.44017.020032
Valiant18.162251052.763.46020.221031

Consider the following relationship

x <- mtcars$hp
y <- mtcars$mpg

plot(x, y, pch=20, xlab="HP", ylab="MPG")

model <- lm(y~x)
summary(model)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.7121 -2.1122 -0.8854  1.5819  8.2360 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
x           -0.06823    0.01012  -6.742 1.79e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.863 on 30 degrees of freedom
Multiple R-squared:  0.6024,    Adjusted R-squared:  0.5892 
F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

For the intercept this means that:

A hypothetical car with hp=0 will have mpg = 30.09 = \(\beta_0\)

It’s more interesting and instructive to consider the interpretation of the slope:

Let’s say we have some covariate \(x_0\) then the expected value for \(y(x_0)\) is given by

\[y(x_0) = \beta_0 + \beta_1 x_0\]

What’s the expected value for \(x_0 + 1\)?

\[ \begin{align} y(x_0 + 1) &= \beta_0 + \beta_1 \times (x_0 + 1)\\ \\ &= \beta_0 + \beta_1 x_0 + \beta_1\\ \\ &= y(x_0) + \beta_1\\ \\ \implies \beta_1 &= y(x_0 + 1) - y(x_0) \end{align} \]

Categorical covariates

Up until now, we have looked at simple linear regression models where both \(x\) and \(y\) are quantitative.

Let’s confirm that cyl is indeed categorical:

unique(mtcars$cyl)
  1. 6
  2. 4
  3. 8

Another example we have is with the iris dataset:

iris %>% head()
A data.frame: 6 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa

Let’s consider the following example:

We want to see if there is a relationship between species and sepal.length. How would we start the EDA?

y <- iris$Sepal.Length
x <- iris$Species

boxplot(Sepal.Length ~ Species, iris, col=c("red", "green", "blue"))

Let’s just run a linear regression model and see what the model output is going to look like:

cat_model <- lm(Sepal.Length ~ Species, iris)
cat_model

Call:
lm(formula = Sepal.Length ~ Species, data = iris)

Coefficients:
      (Intercept)  Speciesversicolor   Speciesvirginica  
            5.006              0.930              1.582  

Even if \(x\) is categorical, we can still write down the regression model as follows:

where \(x_i \in \{ \texttt{setosa}, \ \texttt{versicolor}, \ \texttt{virginica} \}\). This means that we end up with, (fundamentally) three different models

  1. \(y_i = \beta_0 + \beta \times \mathbf{1}(x_i = \texttt{setosa})\)
  2. \(y_i = \beta_0 + \beta \times \mathbf{1}(x_i = \texttt{versicolor})\)
  3. \(y_i = \beta_0 + \beta \times \mathbf{1}(x_i = \texttt{virginica})\)

Now, the interpretation for the coefficients are as follows:

Intercept

\(\beta_0\) is the expected \(y\) value when \(x\) belongs to the base category. This is what the intercept is capturing.

Slopes

\(\beta_1\) with the name Species.versicolor represents the following

(Intercept) = \(y(x = \texttt{setosa})\)

Species.versicolor = \(y(x = \texttt{versicolor}) - y(x = \texttt{setosa})\)

Species.virginica = \(y(x = \texttt{virginica}) - y(x = \texttt{setosa})\)

Reordering the factors

Let’s say that we didn’t want setosa to be the baseline level, and, instead, we wanted virginica to be the baseline level. How would we do this?

First, we’re going to reorder/relevel the categorical covariate

summary(iris$Species) # before

iris %>% 
mutate(Species = relevel(Species, "virginica")) %>% 
head(3)

summary(iris$Species) # After
setosa
50
versicolor
50
virginica
50
A data.frame: 3 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
setosa
50
versicolor
50
virginica
50

Once we do the releveling, we can now run the regression model:

new_cat_model <- lm(Sepal.Length ~ Species, iris)
new_cat_model

Call:
lm(formula = Sepal.Length ~ Species, data = iris)

Coefficients:
      (Intercept)  Speciesversicolor   Speciesvirginica  
            5.006              0.930              1.582  







Thu, Feb 9

Packages for today:

library(ISLR2)
# library(plotly)
library(purrr)

Attaching package: ‘purrr’


The following object is masked from ‘package:renv’:

    modify

Multiple Regression

This is the extension of simple linear regression to multiple covariates \(X = [x_1 | x_2 | \dots | x_p]\), i.e.,

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots \beta_p x_p + \epsilon \]

In particular, the data looks like the following:

\(\mathbf y\) \(\mathbf x_1\) \(\mathbf x_2\) \(\dots\) \(\mathbf x_p\)
\(y_{1}\) \(x_{1,1}\) \(x_{2,1}\) \({...}\) \(x_{p,1}\)
\(y_{2}\) \(x_{1,2}\) \(x_{2,2}\) \({...}\) \(x_{p,2}\)
\(y_{3}\) \(x_{1,3}\) \(x_{2,3}\) \({...}\) \(x_{p,3}\)
\({\vdots}\) \({\vdots}\) \({\vdots}\) \(\ddots\) \(\vdots\)
\(y_{n}\) \(x_{1,n}\) \(x_{2,n}\) \({...}\) \(x_{p,n}\)

and, the full description of the model is as follows:

\[ y_i = \beta_0 + \beta_1 x_{1, i} + \beta_2 x_{2, i} + \dots + \beta_p x_{p, i} + \epsilon_i, \]


Consider the Credit dataset

library(ISLR2)
library(plotly)
attach(Credit)

df <- Credit %>% tibble()
colnames(df) <- tolower(colnames(df))
df %>% head()

Attaching package: ‘plotly’


The following object is masked from ‘package:ggplot2’:

    last_plot


The following object is masked from ‘package:stats’:

    filter


The following object is masked from ‘package:graphics’:

    layout


The following objects are masked from Credit (pos = 4):

    Age, Balance, Cards, Education, Income, Limit, Married, Own,
    Rating, Region, Student

A tibble: 6 × 11
incomelimitratingcardsageeducationownstudentmarriedregionbalance
<dbl><dbl><dbl><dbl><dbl><dbl><fct><fct><fct><fct><dbl>
14.891360628323411No No YesSouth 333
106.025664548338215YesYesYesWest 903
104.593707551447111No No No West 580
148.924950468133611YesNo No West 964
55.882489735726816No No YesSouth 331
80.180804756947710No No No South1151

and, we’ll look at the following three columns: income, rating, limit.

df3 <- df %>% select(income, rating, limit)
df3 %>% head()
A tibble: 6 × 3
incomeratinglimit
<dbl><dbl><dbl>
14.8912833606
106.0254836645
104.5935147075
148.9246819504
55.8823574897
80.1805698047

If we want to see how the credit limit is related to income and ccredit rating, we can visualize the following plot

fig <- plot_ly(df3, x=~income, y=~rating, z=~limit)
fig %>% add_markers()

The regression model is as follows:

model <- lm(limit ~ income + rating, df3)
model

Call:
lm(formula = limit ~ income + rating, data = df3)

Coefficients:
(Intercept)       income       rating  
  -532.4711       0.5573      14.7711  

What does the regression model look like here?

ranges <- df3 %>%
    select(income, rating) %>%
    colnames() %>%
    map(\(x) seq(0.1 * min(df3[x]), 1.1 * max(df3[x]), length.out = 50))

b <- model$coefficients
z <- outer(
    ranges[[1]], 
    ranges[[2]], 
    Vectorize(function(x2, x3) {
        b[1] + b[2] * x2 + b[3] * x3
    })
)
fig %>%
    add_surface(x = ranges[[1]], y = ranges[[2]], z = t(z), alpha=0.3) %>% 
    add_markers()

What is the interpretation for the coefficients?

  1. \(\beta_0\) is the expected value of \(y\) when \(income = 0\) and \(rating = 0\)

  2. \(\beta_1\) is saying that if \(rating\) is held constant and \(income\) changes by 1 unit, then the corresponding change in the limit is \(0.5573\)

  3. \(\beta_2\) is saying that if income is held constant and rating changes by \(1\) unit, then the corresponding change in limit is \(14.771\)

What about the significance?

summary(model)

Call:
lm(formula = limit ~ income + rating, data = df3)

Residuals:
    Min      1Q  Median      3Q     Max 
-420.97 -121.77   14.97  126.72  485.48 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -532.47115   24.17283 -22.028   <2e-16 ***
income         0.55727    0.42349   1.316    0.189    
rating        14.77115    0.09647 153.124   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 182.3 on 397 degrees of freedom
Multiple R-squared:  0.9938,    Adjusted R-squared:  0.9938 
F-statistic: 3.18e+04 on 2 and 397 DF,  p-value: < 2.2e-16